I enjoyed the soft landing on this course. It was comforting to start
and go through the very basics.
From this course, I expect to learn to integrate the use of github and R
markdown in my daily work. I also expect to learn the best practices in
open data science. I heard about this course through
the email list of my doctoral program.
So far the book and exercises have been a great crash course to R tools and RStudio.I enjoy the style of writing in the book. My favorite section so far is the chapter 4 focusing on plots. I found it useful.
My GitHub repository: https://github.com/kattilakoski/IODS-project
The dataset used in this assignment is from a survey of learning approaches and achievements of students on an introductory statistics course. In this assignment we use a subset of the results including information on gender, age, attitude, total points and approaches (deep, surface and strategic) of 166 students.
# access the GGally and ggplot2 libraries
library(GGally)
library(ggplot2)
library(tidyverse)
#Read the data in:
learning2014 <- read_csv("./data/learning2014.csv")
#Explore dimensions:
dim(learning2014)
## [1] 166 7
#and structure
str(learning2014)
## spc_tbl_ [166 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ gender : chr [1:166] "F" "M" "F" "M" ...
## $ age : num [1:166] 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num [1:166] 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num [1:166] 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num [1:166] 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num [1:166] 2.58 3.17 2.25 2.25 2.83 ...
## $ points : num [1:166] 25 12 24 10 22 21 21 31 24 26 ...
## - attr(*, "spec")=
## .. cols(
## .. gender = col_character(),
## .. age = col_double(),
## .. attitude = col_double(),
## .. deep = col_double(),
## .. stra = col_double(),
## .. surf = col_double(),
## .. points = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
# draw a scatter plot matrix of the variables in learning2014.
# [-1] excludes the first column (gender)
pairs(learning2014[-1])
# create a more advanced plot matrix with ggpairs()
p <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
# draw the plot
p
In the first scatter plot matrix each variable is plotted against each others. The variables are written in a diagonal in the plot. In the plots on the same row with the variable text box the variable is on the y axis of the plot. In the plots in the same column with the variable text box the variable is on the x-axis of the plot.
Based on this scatter plot matrix there are no clearly visible correlations between any of the variables. However, some seem to have some kind of upward or downward trend such as attitude and points as well as deep vs surface strategies.
In the second plot matrix created with ggpairs() the variables are on the top and left side of the plot. The top right side of the matrix shows the correlation between the continuous variables (age, attitude, deep, strategic, surface), the lower left side shows the scatter plots of the continuous variables, the diagonal the density plots of the continuous variables, and the leftmost column shows the histograms and box plots for the combinations between the categorical and the continuous variables.
From the matrix we can see that majority of the interviewees were female and around 20 years old. Age was not significantly correlating with any variable. Attitude had significant positive correlation with points and significant negative correlation with surface approach. Deep approach had significant negative correlation with surface approach. Strategic approach had negative correlation with surface approach. This matrix confirmed the trends that were visible in the previous scatter plot matrix
# create a regression model with three explanatory variables
my_model2 <- lm(points ~ attitude + stra + surf, data = learning2014)
# print out a summary of the model
summary(my_model2)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
# create a regression model without the variables that did not have statistically significant relationship with points
my_model3 <- lm(points ~ attitude, data = learning2014)
# print out a summary of the model
summary(my_model3)
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Based on the results of the t-test used in the linear regression the only significant correlation seems to be a positive correlation between attitude and points (P<0.001). The t-test compares the means of the two groups and tests whether the differences between them are related. As both the strategic and surface variables had a p-value greater than 0.1 they did not have statistically significant relationship with points variable.
The multiple R-squared for the first regression model with all three explanatory variables is 0.2074. This means that only little of the variability in the points is explained by these explanatory variables. Even though the relationship between attitude and points is significant the R-squared value of the regression model including only them is even smaller at 0.1906 meaning that attitude explains even less of the variability in the points.
# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5 for Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage
par(mfrow = c(2,2))
plot(my_model2, c(1,2,5))
#same for the model with only attitude as the explanatory variable
par(mfrow = c(2,2))
plot(my_model3, c(1,2,5))
It seems that there is no distinctive pattern in the Residuals vs fitted plots in either case meaning that there is no non-linear relationship between the explanatory and target variables.
The normal Q-Q plots show somewhat straight lines following the dotted line with exceptions in the beginning and end. These exceptions may indicate that the data is not following a normal distribution.
In the residuals vs leverage plots there would be dotted lines in the upper and lower right corners showing if some of the observations had high Cook’s distance scores meaning they are influential to the regression results and are not following the trend in the rest of the cases. In these plots none of the cases seem to be very influential to the regression results.
#install packages:
# install.packages("boot")
# install.packages("readr")
# access libraries
library(dplyr); library(ggplot2); library(readr); library(tidyr); library(boot)
## Warning: package 'boot' was built under R version 4.2.3
#Read the joined student alcohol consumption data into R
alc <- read_csv("./data/student_alc.csv")
## Rows: 370 Columns: 35
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (17): school, sex, address, famsize, Pstatus, Mjob, Fjob, reason, guardi...
## dbl (17): age, Medu, Fedu, traveltime, studytime, famrel, freetime, goout, D...
## lgl (1): high_use
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Print out the names of the variables in the data
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
This data set includes data on student achievements in two Portuguese schools. It includes 370 observations and 35 variables. Variables include for example student grades, sex and age. The dataset was formed by combining two datasets, one regarding the student performance in mathematics and the other in Portuguese language.
###Choosing variables for this analysis For this analysis I choose to focus on age, mother’s education, study time and absences. My hypothesis is that younger students drink less than older ones, the higher the education level of the mother the lower the alcohol consumption of the student is, students studying more drink less alcohol than students studying less and that the more absences the student has the more they drink.
#distribution of alcohol consumption and age
d1 <- ggplot(alc, aes(x = high_use, y = age)) + geom_boxplot() + ggtitle("Age and alcohol consumption")
d1
#distribution of alcohol consumption and mother's education
d2 <- ggplot(alc, aes(x = high_use, y = Medu)) + geom_boxplot() + ggtitle("Mother's education and student alcohol consumption")
d2
#distribution of alcohol consumption and study time
d3 <- ggplot(alc, aes(x = high_use, y = studytime)) + geom_boxplot() + ggtitle("Studytime and alcohol consumption")
d3
#distribution of alcohol consumption and absences
d4 <- ggplot(alc, aes(x = high_use, y = absences)) + geom_boxplot() + ggtitle("Absences and alcohol consumption")
d4
#Summary statistics
#age
# produce summary statistics by group
alc %>% group_by(high_use) %>% summarise(count = n(), mean(age))
## # A tibble: 2 × 3
## high_use count `mean(age)`
## <lgl> <int> <dbl>
## 1 FALSE 259 16.5
## 2 TRUE 111 16.8
#mother's aducation
alc %>% group_by(high_use) %>% summarise(count = n(), mean(Medu))
## # A tibble: 2 × 3
## high_use count `mean(Medu)`
## <lgl> <int> <dbl>
## 1 FALSE 259 2.80
## 2 TRUE 111 2.79
#study time
alc %>% group_by(high_use) %>% summarise(count = n(), mean(studytime))
## # A tibble: 2 × 3
## high_use count `mean(studytime)`
## <lgl> <int> <dbl>
## 1 FALSE 259 2.16
## 2 TRUE 111 1.77
#absences
alc %>% group_by(high_use) %>% summarise(count = n(), mean(absences))
## # A tibble: 2 × 3
## high_use count `mean(absences)`
## <lgl> <int> <dbl>
## 1 FALSE 259 3.71
## 2 TRUE 111 6.38
Based on the distribution plots it looks like my hypothesis is correct and that younger students drink less than older students. When looking at the numerical summary the difference in the mean age is only 0.27 years which seems small. Against my hypothesis it looks like there is no difference in the alcohol use of students with mother’s from different education levels and this is also visible in the summary as the difference between mean education level of mothers is only 0.01. Based on the distributions it looks like my hypothesis on study time was right as the students drinking less seem to study more and this is backed up by the summary as well as there is a difference in the mean study times of students with high and low alcohol use. However, the difference is only 0.40 hours. It seems that my hypothesis was right about the absences as students that drink more seem to have more absences. The difference between absences of students with high and low alchol use is the clearest and the mean absence of students with high alcohol use is 2.67 absences more than students with low alcohol use.
# find the model with glm()
m <- glm(high_use ~ age + Medu + studytime + absences, data = alc, family = "binomial")
# print out a summary of the fitted model
summary(m)
##
## Call:
## glm(formula = high_use ~ age + Medu + studytime + absences, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2166 -0.8430 -0.6547 1.1573 2.2772
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.798207 1.786727 -1.566 0.117323
## age 0.164398 0.103533 1.588 0.112316
## Medu -0.003725 0.111021 -0.034 0.973232
## studytime -0.579521 0.158539 -3.655 0.000257 ***
## absences 0.075027 0.023247 3.227 0.001249 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 417.24 on 365 degrees of freedom
## AIC: 427.24
##
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(m)
## (Intercept) age Medu studytime absences
## -2.798207282 0.164397602 -0.003725333 -0.579521121 0.075027298
# compute odds ratios (OR)
OR <- coef(m) %>% exp
# compute confidence intervals (CI)
CI <- confint(m) %>% exp()
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.06091918 0.001766207 1.979230
## age 1.17868287 0.963461160 1.447356
## Medu 0.99628160 0.801834321 1.240253
## studytime 0.56016655 0.406319506 0.757511
## absences 1.07791358 1.032143878 1.130799
Based on the negative coefficient estimates of the summary statistics of the fitted model, higher values of mother’s education and student’s study time are associated with a lower likelihood of the high_use variable being TRUE and the opposite for the age and absences variables. However, only the coefficients for study time and absences are statistically significant. Therefore from the summary of the model one could say that the more time time a student spends studying the lower the likelihood of them having high usage of alcohol. And the higher the number of absence days of the student is the higher the likelihood that their alcohol usage is high. Based on these results my hypothesis about mother’s education and age were wrong and the hypothesis on study time and absences were right.
Based on the odds ratio for a one year increase in age the odds that the student have high alcohol usage increases by a factor of 1.18 when other variables stay stable. And the same idea for other variables. So each additional education level the odds that student has high alcohol usage increases by a factor of 1.00 meaning that there is no change in the alcohol usage, for each hour spent studying the odds decrease by a factor of 0.56 and for every increased absence the odds increase by a factor of 1.08 . Based on these results the biggest increases or decreases in odds are caused by age (the higher the age the higher the alcohol usage) and study time (the higher the study time the lower the alcohol usage).The confidence intervals for the intercept, age, Medu, studytime and absences are 0.00-1.98, 0.96-1.45, 0.80-1.24, 0.41-0.76 and 1.03-1.13 respectively. Interpretation for these values are that we are 95% confident that the correlation between the given variable and target variable is between the values of those confidence intervals. For example we are 95% confident that the correlation between age and alcohol usage is between 0.96-1.45. This is pretty high interval and therefore we don’t have very good knowledge of the effect. Same goes for mother’s education. The confidence intervals for study time and absences are smaller meaning we have better knowledge of their effect on the target variable (alcohol use).Based on these results my hypothesis about mother’s education was wrong and the hypotheses on age, study time and absences were right.
# find the model with glm() for only the statistically significant study time and absences
m <- glm(high_use ~ studytime + absences, data = alc, family = "binomial")
# predict() the probability of high_use
probabilities <- predict(m, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)
# see the last ten original classes, predicted probabilities, and class predictions
select(alc, studytime, absences, high_use, probability, prediction) %>% tail(10)
## # A tibble: 10 × 5
## studytime absences high_use probability prediction
## <dbl> <dbl> <lgl> <dbl> <lgl>
## 1 2 3 FALSE 0.266 FALSE
## 2 1 0 FALSE 0.336 FALSE
## 3 1 7 TRUE 0.468 FALSE
## 4 3 1 FALSE 0.149 FALSE
## 5 1 6 FALSE 0.448 FALSE
## 6 3 2 FALSE 0.159 FALSE
## 7 2 2 FALSE 0.251 FALSE
## 8 2 3 FALSE 0.266 FALSE
## 9 1 4 TRUE 0.410 FALSE
## 10 1 2 TRUE 0.372 FALSE
#2x2 tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 248 11
## TRUE 93 18
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2810811
Based on the cross tabulation/confusion matrix the model predicts farely well the cases were alcohol usage is not high only 11 false positives out of the 259 students that reported low alcohol usage. The model does not predict the high use cases very well as out of the 111 high use cases it only got 18 right and 93 false positives.
The model has a prediction error of 0.28 which means that it makes wrong predictions 28% of the times. This is a rather high prediction error and I would conclude that this model does not predict very well and therefore is not very powerful.
I think the “error rate of my guesses” and of the model were pretty similar. I would say that the model is slightly better for prediction than guessing but would have much room for improvement. This could maybe be done for example by including other variables.
The dataset used in this assingment and the explanations for variablenames are found in here: https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html The dataset has 506 rows and 14 columns and in includes information on the housing values in suburbs of Boston. The columns include for example per capita crime rate by town (crim) and average number of rooms per dwelling (rm).
# Show a graphical overview of the data
pairs(Boston)
p <- ggpairs(Boston)
# draw the plot
p
# Show summaries of the variables in the data.
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
# calculate the correlation matrix and round it
cor_matrix <- cor(Boston) %>% round(digits = 2)
# print the correlation matrix
cor_matrix
## crim zn indus chas nox rm age dis rad tax ptratio
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 0.29
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31 -0.39
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72 0.38
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04 -0.12
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67 0.19
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29 -0.36
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51 0.26
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53 -0.23
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91 0.46
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 0.46
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51
## black lstat medv
## crim -0.39 0.46 -0.39
## zn 0.18 -0.41 0.36
## indus -0.36 0.60 -0.48
## chas 0.05 -0.05 0.18
## nox -0.38 0.59 -0.43
## rm 0.13 -0.61 0.70
## age -0.27 0.60 -0.38
## dis 0.29 -0.50 0.25
## rad -0.44 0.49 -0.38
## tax -0.44 0.54 -0.47
## ptratio -0.18 0.37 -0.51
## black 1.00 -0.37 0.33
## lstat -0.37 1.00 -0.74
## medv 0.33 -0.74 1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
Based on the plot matrix it seems that all variables have statistically significant correlations with each other except chas (Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)) which doesn’t correlate at all or has a weaker correlation than that of other variables. Chas has values only of 0 and 1. From the density plots we can see for example that crim, zn, chas, dis, rad, tax and lstat are mostly very low (although some have a second small peak), only rm looks to be normally distributed, indus has two even peaks, and age, ptratio and black have mostly higher values.
Same trends are showing in the corrplot. However, maybe the strength (intensity of color) and sign of the correlation (blue for pos and red for neg) is more easily visible. Strongest negative correlations (meaning when one gets higher the other gets lower) seem to be between indus and dis, nox and dis, age and dis, and lstat and medv. Strongest positive correlation (one gets higher the other gets higher too) seem to be between rad and tax.
# Standardize the dataset
boston_scaled <- scale(Boston)
# Print out summaries of the scaled data
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crime', using the quantiles as the break points in the categorical variable
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
# Drop the old crime rate variable from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
# Divide the dataset to train and test sets, so that 80% of the data belongs to the train set
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
As a result of scaling the variable values got smaller and the mean of each variable is zero. This is because in the scaling the column means were subtracted from the corresponding columns and the difference was divided with standard deviation.
# Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2623762 0.2425743 0.2524752 0.2425743
##
## Group means:
## zn indus chas nox rm age
## low 0.92386255 -0.9303368 -0.086616792 -0.8885408 0.43381949 -0.9018561
## med_low -0.07793854 -0.2899302 0.008892378 -0.5815612 -0.05495735 -0.3841604
## med_high -0.38551208 0.1504647 0.190859195 0.3898722 0.07049683 0.4169570
## high -0.48724019 1.0149946 -0.071456607 1.0287840 -0.40963712 0.8134087
## dis rad tax ptratio black lstat
## low 0.9061102 -0.6795860 -0.7226524 -0.4431092 0.37490822 -0.76922223
## med_low 0.3448336 -0.5541258 -0.4972708 -0.0732558 0.31707026 -0.23826018
## med_high -0.4039108 -0.4369123 -0.3510712 -0.4119310 0.06243063 0.02722113
## high -0.8469889 1.6596029 1.5294129 0.8057784 -0.80118228 0.88109264
## medv
## low 0.49966767
## med_low 0.05656725
## med_high 0.22625809
## high -0.68090978
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.125599985 0.59446583 -0.89215753
## indus 0.070700524 -0.26499149 0.52216154
## chas -0.005284704 -0.03480417 0.05924504
## nox 0.223676870 -0.69971992 -1.33571643
## rm 0.042317382 0.03650324 -0.09157298
## age 0.233341233 -0.31992232 -0.10605213
## dis -0.166230845 -0.12710158 0.10333544
## rad 4.003043722 0.77843541 -0.09673008
## tax 0.091155457 0.19463772 0.39828381
## ptratio 0.144593246 0.09669847 -0.15830474
## black -0.116625865 0.05504301 0.12884670
## lstat 0.182004629 -0.35112742 0.18140544
## medv 0.053250862 -0.50730387 -0.24295114
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9638 0.0293 0.0069
#Draw the LDA (bi)plot.
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
graphics::arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results (select both lines and execute them at the same time!)
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)
# Save the crime categories from the test set
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# Then predict the classes with the LDA model on the test data
lda.pred <- predict(lda.fit, newdata = test)
# Cross tabulate the results with the crime categories from the test set.
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 14 5 2 0
## med_low 2 19 7 0
## med_high 0 9 12 3
## high 0 0 1 28
#Comment on the results.
Based on the cross tabulations the LDA model worked best on predicting the high crime rates as it predicted all of them right. Of the low crime rates the model predicted 15 out of 25 right, of the med_low 17 out of 28 and of the med_high rates 10 out of 23 were predicted right by the LDA model.
# Reload the Boston dataset
data("Boston")
# Standardize the dataset to get comparable distances
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
# Calculate the distances between the observations
# euclidean distance matrix
dist_eu <- dist(boston_scaled)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
#Run k-means algorithm on the dataset.
# k-means clustering
km <- kmeans(boston_scaled, centers = 3)
# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)
#Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results.
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line') # here it seems 2 is an optimal number of clusters
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# k-means clustering
km <- kmeans(boston_scaled, centers = 2)
# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)
#p <- ggpairs(boston_scaled, col = km$cluster)
#p
The optimal number of clusters is visible in the qplot where the line drops suddenly. In this case it is around 2. Therefore 2 is the optimal number of clusters.
# Standardize the dataset
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
# Perform k-means on the original Boston data with some reasonable number of clusters (> 2)
# k-means clustering
km <- kmeans(boston_scaled, centers = 3)
# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)
# add the cluster value to scaled data
boston_scaled <- data.frame(boston_scaled, km$cluster)
# Perform LDA using the clusters as target classes. Include all the variables in the Boston data in the LDA model.
# linear discriminant analysis
lda.fit <- lda(km.cluster ~ ., data = boston_scaled[,-4]) #here I got an error of variable 4 (aka chas) being constant within groups and that is why I removed it. I am not sure if this is right way to do it but now it doesn't give me an error
# print the lda.fit object
lda.fit
## Call:
## lda(km.cluster ~ ., data = boston_scaled[, -4])
##
## Prior probabilities of groups:
## 1 2 3
## 0.06916996 0.61067194 0.32015810
##
## Group means:
## crim zn indus nox rm age dis
## 1 -0.2048299 -0.1564737 0.2306535 0.3342374 0.3344149 0.3170678 -0.3634565
## 2 -0.3882449 0.2731699 -0.6264383 -0.5823006 0.2188304 -0.4585819 0.4807157
## 3 0.7847946 -0.4872402 1.1450405 1.0384727 -0.4896488 0.8062002 -0.8383961
## rad tax ptratio black lstat medv
## 1 -0.02700292 -0.1304164 -0.4453253 0.1787986 -0.1976385 0.6422884
## 2 -0.58641200 -0.6161585 -0.2814183 0.3151747 -0.4640135 0.3182241
## 3 1.12436056 1.2034416 0.6329916 -0.6397959 0.9277624 -0.7457491
##
## Coefficients of linear discriminants:
## LD1 LD2
## crim 0.02479166 -0.13204141
## zn 0.42787622 -0.04638198
## indus 1.15646011 0.64348753
## nox 0.47943272 0.42987408
## rm 0.13637610 -0.12823804
## age -0.06654278 0.30029385
## dis -0.01915297 0.20848367
## rad 0.74637979 0.69845574
## tax 0.27967651 -1.02040695
## ptratio 0.19355485 -0.21964359
## black -0.04753224 0.11581547
## lstat 0.47213016 0.02172623
## medv 0.06797263 0.92531426
##
## Proportion of trace:
## LD1 LD2
## 0.9822 0.0178
# Visualize the results with a biplot (include arrows representing the relationships of the original variables to the LDA solution).
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
graphics::arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# plot the lda results
plot(lda.fit, dimen = 2, col = boston_scaled$km.cluster, pch = km$cluster)
lda.arrows(lda.fit, myscale = 2)
# Interpret the results. Which variables are the most influential linear separators for the clusters? (0-2 points to compensate any loss of points from the above exercises)
Based on the biplot it looks like the variables medv, rad, indus and tax are the most influential variables which shows inthe plot as longer arrows.
#Run the code below for the (scaled) train data that you used to fit the LDA. The code creates a matrix product, which is a projection of the data points.
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 2
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
#Next, install and access the plotly package. Create a 3D plot (cool!) of the columns of the matrix product using the code below.
library(plotly)
## Warning: package 'plotly' was built under R version 4.2.3
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
#Adjust the code: add argument color as a argument in the plot_ly() function. Set the color to be the crime classes of the train set.
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
#Draw another 3D plot where the color is defined by the clusters of the k-means. How do the plots differ? Are there any similarities?
#kmeans for train set
# Perform k-means on the original Boston data with some reasonable number of clusters (> 2)
# k-means clustering
set.seed(123)
km_train <- kmeans(train[,-14], centers = 3)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km_train$cluster)
## Warning in min(x, na.rm = na.rm): no non-missing arguments to min; returning Inf
## Warning in max(x, na.rm = na.rm): no non-missing arguments to max; returning
## -Inf
Yes, the plots differ in that the dots color in different way in the plots but the spatial location of the dots seem to stay the same in both the plots.